Analyzing Airbnb Prices in New York

Jihyun Kim

Final Project for Data Bootcamp, Fall 2016

What determines each Airbnb's listing price?

Background

Everything in New York is expensive. For first time travelers, New York may seem even more expensive. At the same time, travelers have different wants and needs from their accomodation that a student or a working person would. So I wanted to analyze the price trend of the Airbnb listing prices in the eyes of a traveler.

Travelers of different budget and purpose would have different priorities, but most would definately prefer good accessibility to the top tourist attractions they want to visit. Will this have an effect on the Airbnb rental price?

Data Source

For this data analysis, I used the Airbnb open data avaiable here. I used the listing.csv file for New York.

Libraries and API

Since the csv file contained more than 20,000 entries, I decided to do some basic scrubbing first and then export to a different csv using the csv library. I then used the pandas library to manipulate and display selected data and used the matplotlib and seaborn libraries for visualization. To calculate the average distance from each listing to the top rated tourist attractions of New York, I used the Beautiful Soup library to parse the website and retrieve a list of attraction names. I then used the Google Places API to get each attraction spot's detailed latitude and longitude to calculate the great circle distance from each airbnb apartment.


In [3]:
import sys
import pandas as pd                   
import matplotlib.pyplot as plt      
import datetime as dt                  
import numpy as np                    
import seaborn as sns
import statistics
import csv
from scipy import stats
from bs4 import BeautifulSoup as bs
import urllib.request
from googleplaces import GooglePlaces, types, lang
from geopy.distance import great_circle
import geocoder

%matplotlib inline                     

print('Python version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())


/Users/sglyon/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:878: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
Python version:  3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]
Pandas version:  0.19.0
Today:  2016-12-22

Google Places API Configuration


In [4]:
apikey = 'AIzaSyAiZn9omCnuF2q89cArLpVfxmlGV7nnjFA'
gplaces = GooglePlaces(apikey)
  1. Write a function to calculate the distance from each listing to top trip advisor attractions
  2. Visualize data including where the closest ones are, the most expensive, the relative borderline
  3. Write a new function that calculates the distance from the closest subwaby stations
  4. somehow visualize the convenience and access from each subway station using Google maps API
  5. decide where is the best value/distance
  6. make a widget that allows you to copy and paste the link

Defining Functions

1. tripadvisor_attractions( url, how_many )

This function takes 2 parameters, the url of the trip advisor link and the number of top attractions one wants to check. It then uses the beautiful soup library to find the div that contains the list of top rated tourist attractions in the city and returns them as a list.


In [5]:
def tripadvisor_attractions(url, how_many):
    page = urllib.request.urlopen(url)
    #using beautiful soup to select targeted div
    soup = bs(page.read(), "lxml")
    filtered = soup.find("div", {"id": "FILTERED_LIST"})
    top_list = filtered.find_all("div", class_="property_title")
    
    sites = []
    #save the text within hyperlink into an empty list
    for site in top_list:
        site = (site.a).text
        site = str(site)
        if not any(char.isdigit() for char in site):
            sites.append(site)
    #splices the list by how many places user wants to include        
    sites = sites[:how_many]
    return sites

2. ta_detail(ta_list, city)

This function takes the list returned by the tripadvisor_attractions() function as well as the city name in a string. I explictly ask for the city name so that Google Places API will find more accurate place details when it looks up each tourist attraction. It returns a dataframe of the tourist attraction, its google place ID, longitude, and latitude.


In [6]:
#ta short for tourist attraction
def ta_detail(ta_list, city):
    ta_df = pd.DataFrame( {'Tourist Attraction' : '',
                           'place_id' : '',
                           'longitude' : '',
                           'latitude' : '' },
                        index = range(len(ta_list)))
    
    for i in range(len(ta_list)):
        query_result = gplaces.nearby_search(
                            location = city,
                            keyword = ta_list[i],
                            radius=20000)
        
        #get only the top first query
        query = query_result.places[0]
        ta_df.loc[i, 'Tourist Attraction'] = query.name
        ta_df.loc[i, 'longitude'] = query.geo_location['lng']
        ta_df.loc[i, 'latitude'] = query.geo_location['lat']
        ta_df.loc[i, 'place_id'] = query.place_id
        
    return ta_df

3. latlong_tuple(ta_df)

This function takes the tourist attraction data frame created above then returns a list of (latitude, longitude) tuples for every one of them.


In [7]:
def latlong_tuple(ta_df):
    tuple_list = []
    for j, ta in ta_df.iterrows():
        ta_geo = (float(ta['latitude']), float(ta['longitude']))
        tuple_list.append(ta_geo)
    return tuple_list

4. clean_csv(data_in, geo_tuples)

This function is the main data scraping function. I tried to first import the csv as a dataframe then clearning each entry, but the pandas iterrow and itertuple took a very long time so I decided to do the basic scrubbing when I was importing the csv. This function automatically saves a new copy of the cleaned csv with a file name extension _out.csv. The function itself doesn't return anything.


In [8]:
def clean_csv(data_in, geo_tuples):
    #automatically generates a cleaned csv file with the same name with _out.csv extension
    index = data_in.find('.csv')
    data_out = data_in[:index] + '_out' + data_in[index:]

    #some error checking when opening
    try:
        s = open(data_in, 'r')
    except:
        print('File not found or cannot be opened')
    else:
        t = open(data_out, 'w')
        print('\n Output from an iterable object created from the csv file')
        
        reader = csv.reader(s)
        writer = csv.writer(t, delimiter=',')
        
        #counter for number or rows removed during filtering
        removed = 0
        added = 0
        
        header = True
        for row in reader:
            if header:
                header = False
                
                for i in range(len(row)):
                    #saving indices for specific columns
                    if row[i] == 'latitude':
                        lat = i
                    elif row[i] == 'longitude':
                        lng = i
                row.append('avg_dist')   
                writer.writerow(row)
            
            #only add the row if the number of reviews is more than 1
            elif(int(row[-1]) > 7):
                #creaing a geo tuple for easy calculation later on
                tlat = row[lat]
                tlng = row[lng]
                ttuple = (tlat, tlng)
                
                dist_calc = []
                
                #calculate the distance from each listing and to every top tourist attractions we saved
                #if the distance is for some reason greater than 100, don't add it as it would skew the result.
                for i in geo_tuples:
                    dist_from_spot = round(great_circle(i, ttuple).kilometers, 2)
                    if (dist_from_spot < 100):
                        dist_calc.append(dist_from_spot)
                    else:
                        print(ta['Tourist Attraction'] + " is too far.")

                #calculates the average distance between the listing and all of the toursist attractions
                avg_dist = round(statistics.mean(dist_calc), 3)
                row.append(avg_dist)
                writer.writerow(row)              
                added += 1
            else:
                removed += 1
        
        s.close()
        t.close()
        print('Function Finished')
        print(added, 'listings saved')
        print(removed, 'listings removed')

Reading in the data: Time for fun!

Reading in the trip advisor url for New York and saving the data

In the cell below, we read in the trip advisor url for New York and save only the top 10 in a list. When we print it, we can validate that these are the famous places New York is famous for.


In [9]:
url = "https://www.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html"
top_10 = tripadvisor_attractions(url, 10)

In [10]:
print(top_10)


['Central Park', 'The Metropolitan Museum of Art', 'Top of the Rock Observation Deck', 'Manhattan Skyline', 'The High Line', 'Broadway', 'Grand Central Terminal', 'Frick Collection', 'Empire State Building', 'Bryant Park']

In [11]:
ta_df = ta_detail(top_10, 'New York, NY')
geo_tuples = latlong_tuple(ta_df)

ta_df


Out[11]:
Tourist Attraction latitude longitude place_id
0 Central Park 40.7828647 -73.9653551 ChIJ4zGFAZpYwokRGUGph3Mf37k
1 The Metropolitan Museum of Art 40.7794366 -73.963244 ChIJb8Jg9pZYwokR-qHGtvSkLzs
2 Top of The Rock 40.7592454 -73.9793043 ChIJe7vKMf9YwokRIMYfDz7iF9o
3 Manhattan Skyline 40.76361410000001 -73.9780574 ChIJbRAuwPlYwokR829KR7QQLVI
4 The High Line 40.7479925 -74.0047649 ChIJ5bQPhMdZwokRkTwKhVxhP1g
5 Broadway Cameras & Electronics 40.6859984 -73.91624109999999 ChIJWYmGLWtcwokRutgJugsHMu8
6 Grand Central Terminal 40.75272619999999 -73.9772294 ChIJhRwB-yFawokRi0AhGH87UTc
7 The Frick Collection 40.7712286 -73.9672226 ChIJHRH97uxYwokR6CBeDyn1jtg
8 Empire State Building 40.7484405 -73.98566439999999 ChIJaXQRs6lZwokRY6EFpJnhNNE
9 Bryant Park 40.7535965 -73.98323259999999 ChIJvbGg56pZwokRp_E3JbivnLQ

The cell below reads in the original csv file, removes some unwanted listings, and adds a new column that has the average distance from the top 10 Trip Advisor approved(!!) tourist attractions.


In [22]:
clean_csv("data/listings.csv", geo_tuples)


 Output from an iterable object created from the csv file
Function Finished
28455 listings saved
11772 listings removed

We then make a copy dataframe listing to play around with.


In [24]:
df = pd.read_csv('data/listings_out.csv')
print('Dimensions:', df.shape)

df.head()
listing = df.copy()


Dimensions: (28455, 17)

In [25]:
listing.head()


Out[25]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 avg_dist
0 9147025 Cozy City Island Cottage 403032 Diane Bronx City Island 40.844870 -73.789541 Entire home/apt 125 2 30 2016-11-27 2.61 1 151 18.603
1 7949480 City Island Sanctuary relaxing BR & Bath w Par... 119445 Linda & Didier Bronx City Island 40.852054 -73.788680 Private room 99 1 21 2016-11-21 1.96 1 179 19.100
2 6627449 Large 1 BDRM in Great location 13886510 Arlene Bronx City Island 40.849775 -73.786609 Entire home/apt 125 3 12 2016-10-24 0.69 1 246 19.104
3 1886820 Quaint City Island Community. 9815788 Steve Bronx City Island 40.841144 -73.783052 Entire home/apt 300 7 0 NaN NaN 1 359 18.849
4 16042478 WATERFRONT STUDIO APARTMENT 9117975 Collins Bronx City Island 40.853491 -73.788607 Private room 200 7 0 NaN NaN 1 168 19.195

Visualizing the Data

Neighbourhood

First, I used the groupby function to group the data by neighbourhood groups. I then make 2 different data frames to plot the price and average distance.


In [43]:
area = listing.groupby('neighbourhood_group')
nbhood_price = area['price'].agg([np.sum, np.mean, np.std])
nbhood_dist = area['avg_dist'].agg([np.sum, np.mean, np.std])

In [44]:
fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True)
fig.suptitle('NY Neighbourhoods: Price vs Average Distance to Top Spots', fontsize=10, fontweight='bold')
nbhood_price['mean'].plot(kind='bar', ax=ax[0], color='mediumslateblue')
nbhood_dist['mean'].plot(kind='bar', ax=ax[1], color = 'orchid')

ax[0].set_ylabel('Price', fontsize=10)
ax[1].set_ylabel('Average Distance', fontsize=10)


Out[44]:
<matplotlib.text.Text at 0x1234693c8>

Then I used the groupby function for neighbourhoods to see a price comparison between different New York neighbourhoods


In [45]:
area2 = listing.groupby('neighbourhood')
nb_price = area2['price'].agg([np.sum, np.mean, np.std]).sort_values(['mean'])
nb_dist = area2['avg_dist'].agg([np.sum, np.mean, np.std])

In [46]:
fig, ax = plt.subplots(figsize=(4, 35))
fig.suptitle('Most Expensive Neighbourhoods on Airbnb', fontsize=10, fontweight='bold')
nb_price['mean'].plot(kind='barh', ax=ax, color='salmon')


Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x1234abeb8>

The most expensive neighbourhood: Breezy Point

Why is Breezy Point so expensive? Below code displays the Airbnb listings in Breezy Point, which turned out to be the "Tremendous stylish hotel" which was the only listing in Breezy Point.


In [47]:
breezy = listing.loc[listing['neighbourhood'] == 'Breezy Point']
breezy


Out[47]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 avg_dist
3379 15976657 *Tremendous stylish hotel 15 75432549 Alvin Queens Breezy Point 40.557278 -73.926337 Entire home/apt 853 1 0 NaN NaN 1 180 22.345

The Second Most Expensive: Manhattan Beach

The second most expensive neighbourhood is also not in Manhattan, in contrast to the first visualization we did that showed Manhattan had the highest average Airbnb price. All apartments in Manhattan Beach turns out to be reasonably priced except "Manhattan Beach for summer rent" which costs 2,800 USD per night.

It seems that outliers are skewing the data quite significantly.


In [48]:
beach = listing.loc[listing['neighbourhood'] == 'Manhattan Beach']
beach


Out[48]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 avg_dist
18530 6179073 Mini Studio in NY city safest area 32052000 Gennady + Laura Brooklyn Manhattan Beach 40.577702 -73.939875 Private room 45 2 66 2016-11-30 3.49 2 333 19.941
18531 15158791 2 BD Modern Luxury Apt. Short Walk to Beach & ... 84558839 Mitch Brooklyn Manhattan Beach 40.580853 -73.939337 Entire home/apt 130 2 7 2016-11-11 2.96 1 323 19.604
18532 13317969 Bright spacious bedroom near beach 56210238 Andrew Brooklyn Manhattan Beach 40.579907 -73.948532 Private room 70 2 15 2016-11-05 2.80 1 179 19.621
18533 6492185 A beautiful and friendly beach area 33929905 Oleg Brooklyn Manhattan Beach 40.578725 -73.952505 Entire home/apt 129 3 22 2016-11-21 1.39 1 161 19.724
18534 13170159 Brooklyn Brand New Studio 62535444 Alex Brooklyn Manhattan Beach 40.578667 -73.952268 Entire home/apt 150 1 5 2016-10-11 0.90 2 364 19.731
18535 1498283 3 BEDROOM DUPLEX PENTHOUSE (HUGE) 7915224 Marina Brooklyn Manhattan Beach 40.579772 -73.940468 Entire home/apt 399 1 1 2015-03-27 0.05 1 365 19.708
18536 13017510 Manhattan Beach for summer rent 71849131 Elena Brooklyn Manhattan Beach 40.579099 -73.952404 Entire home/apt 2800 30 0 NaN NaN 1 365 19.684
18537 15507375 Private Studio close to Manhattan 32052000 Gennady + Laura Brooklyn Manhattan Beach 40.578208 -73.941225 Private room 60 3 1 2016-11-15 1.00 2 330 19.871
18538 13107627 Private Bedroom w/ Private Bath 72542338 Rochette Brooklyn Manhattan Beach 40.580780 -73.953561 Private room 125 3 8 2016-09-25 1.33 1 364 19.489

Room Type

To account for the price difference between room types, I grouped the data by the room_type column and made some visualizations.


In [49]:
area = listing.groupby('room_type')
room_price = area['price'].agg([np.sum, np.mean, np.std])
room_dist = area['avg_dist'].agg([np.sum, np.mean, np.std])

In [50]:
room_price['mean'].plot(title="Average Price by Room Type")


Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x12395ec50>

In [51]:
apt = listing.loc[listing['room_type'] == 'Entire home/apt']
apt = apt.sort_values('price', ascending=False)
apt.drop(apt.head(20).index, inplace=True)
apt.head()


Out[51]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 avg_dist
22149 8515213 Studio 250 New York City 44837740 David Manhattan Theater District 40.756008 -73.989090 Entire home/apt 2000 1 2 2016-11-23 0.18 1 344 2.567
13535 7923681 Lux Penthouse Duplex over 4000SF 14607896 John Manhattan Greenwich Village 40.733531 -73.993083 Entire home/apt 2000 14 0 NaN NaN 1 365 3.997
21292 8736827 Rain or Shine Studios 45837698 Brian Brooklyn Sheepshead Bay 40.585989 -73.927617 Entire home/apt 2000 1 1 2015-10-27 0.07 1 365 19.198
22284 346260 5,000sq ft TriBeCa Loft, 38 windows, 13ft ceil... 1755397 Adam Manhattan Tribeca 40.717345 -74.003730 Entire home/apt 1999 5 1 2012-12-31 0.02 1 192 5.734
7942 5277021 All Star/Fashion weekend/VDAY Pad 351938 Baguettes Brooklyn Columbia St 40.686723 -74.002190 Entire home/apt 1800 1 0 NaN NaN 1 90 8.521

In [52]:
sns.jointplot(x='avg_dist', y="price", data=apt, kind='kde')


/Users/sglyon/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:898: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
/Users/sglyon/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[52]:
<seaborn.axisgrid.JointGrid at 0x12363efd0>

Plotting the Entire Room listings without the top 20 most expensive ones show that there are 2 concentrated correlated areas between average distance and price. The bimodal distribution in average distance might be the concentration of Airbnb listings in Manhattan and Brooklyn


In [36]:
f, ax = plt.subplots(figsize=(11, 6))
sns.violinplot(x="neighbourhood_group", y="price", data=apt, palette="Set3")


Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ab016d8>

Plotting a violin diagram of the prices of all entire homes in different neighbourhood groups show us that Manhattan has more distrubted price range of apartments, albeit on the higher end, while Queens and Bronx have higher concentration of listings at a specific point at a lower price range.

Dealing with Outliers

To deal with some of the outliers at the top, I tried deleting the top 10 or 20 most expensive ones, but this method wasn't very scalable across the dataset neither was it an accurate depiction of the price variety. So I decided to first get an understanding of the most expensive listings in New York and then to create a separate dataframe that removes data entries with price higher or lower than 3 standard deviations from the mean.


In [53]:
fancy = listing.sort_values('price', ascending=False).iloc[:50]
fancy.head(10)


Out[53]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 avg_dist
18129 9528920 Quiet, Clean, Lit @ LES & Chinatown 3906464 Amy Manhattan Lower East Side 40.713554 -73.985069 Private room 9999 99 6 2016-01-01 0.47 1 83 5.623
23797 12955683 Great location Apt 35664632 Duan Manhattan Upper West Side 40.800315 -73.966309 Private room 9998 2 2 2016-05-21 0.29 1 336 5.324
15257 10770507 Comfortable one bedroom in Harlem 2506722 Maisha Manhattan Harlem 40.825321 -73.944530 Entire home/apt 9900 1 14 2016-12-01 1.65 1 154 8.407
24409 13156212 Great room in the Upper West Side 21891775 John Manhattan Upper West Side 40.777222 -73.980647 Shared room 9900 1 0 NaN NaN 1 365 3.171
26655 13882124 Mona Liza 82152033 Jose Brooklyn Williamsburg 40.710928 -73.935731 Private room 9000 1 0 NaN NaN 1 358 6.523
6781 2953058 Film Location 1177497 Jessica Doug Brooklyn Clinton Hill 40.691371 -73.967228 Entire home/apt 8000 1 1 2016-09-15 0.37 10 365 7.643
21343 8704144 Lovely Room , 1 Block subway to NYC 15163256 Faigie Brooklyn Sheepshead Bay 40.607583 -73.955418 Private room 6200 2 22 2016-11-13 1.63 3 350 16.534
19679 1448703 Beautiful 1 Bedroom in Nolita/Soho 213266 Jessica Manhattan Nolita 40.721927 -73.993787 Entire home/apt 5000 1 2 2013-09-28 0.05 1 365 5.021
27737 14590978 Creative Event Space in Williamsburg 90483609 Caitlin Brooklyn Williamsburg 40.720590 -73.959366 Entire home/apt 5000 1 0 NaN NaN 1 55 5.088
15740 12376888 2 Bedroom near Times Square - 18341284 Clare Manhattan Hell's Kitchen 40.761262 -73.992730 Entire home/apt 4700 30 5 2016-10-19 0.91 1 153 2.805

In [54]:
fancy.describe()


Out[54]:
id host_id latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 avg_dist
count 5.000000e+01 5.000000e+01 50.000000 50.000000 50.000000 50.00000 50.00000 31.000000 50.000000 50.000000 50.000000
mean 8.332858e+06 2.451062e+07 40.722822 -73.975732 3289.240000 28.28000 4.50000 0.513548 1.580000 289.360000 6.570160
std 5.087089e+06 2.629486e+07 0.056661 0.025196 2546.645762 158.84777 7.90053 0.484524 1.830468 113.478715 4.433811
min 1.474740e+05 2.132660e+05 40.579099 -74.062775 1500.000000 1.00000 0.00000 0.020000 1.000000 17.000000 2.419000
25% 3.500094e+06 3.348941e+06 40.695802 -73.992561 1612.500000 1.00000 0.00000 0.180000 1.000000 180.000000 3.959500
50% 8.609678e+06 1.480857e+07 40.724807 -73.979625 2150.000000 1.00000 1.00000 0.320000 1.000000 359.500000 5.306500
75% 1.304031e+07 3.422357e+07 40.764200 -73.956874 3337.500000 4.50000 5.75000 0.775000 1.000000 365.000000 7.625250
max 1.617308e+07 9.862030e+07 40.825321 -73.927617 9999.000000 1124.00000 43.00000 1.650000 10.000000 365.000000 19.684000

It is likely that some of the listings listed above are specifically for events and photography, rather than for traveler's accomodation. Also it seems like some of the hosts who didn't want to remove their listing from Airbnb but wasn't available to host rather listed the price as 9,900 USD.

Some of the listings that seemed "normal" but had a very high price were:

99.7 percent of the listings

Using simple statistic, I saved a new dataframe named reviewed that has more than 1 review and is at least within 3 standard deviations from the mean.


In [55]:
reviewed = listing.loc[listing['number_of_reviews'] > 1]
reviewed.describe()
reviewed = reviewed[((reviewed['price'] - reviewed['price'].mean()) / reviewed['price'].std()).abs() < 3]
reviewed.describe()


Out[55]:
id host_id latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 avg_dist
count 2.126500e+04 2.126500e+04 21265.000000 21265.000000 21265.000000 21265.000000 21265.000000 21262.000000 21265.000000 21265.000000 21265.000000
mean 7.793968e+06 2.431723e+07 40.730241 -73.954964 137.075758 3.515072 26.407994 1.830831 1.998260 226.872937 7.146000
std 5.113483e+06 2.668143e+07 0.053791 0.042133 94.695809 11.141008 33.544875 1.537712 2.582234 127.619635 3.657168
min 2.515000e+03 1.783000e+03 40.508684 -74.239859 12.000000 1.000000 2.000000 0.030000 1.000000 8.000000 2.275000
25% 3.143311e+06 3.577509e+06 40.691320 -73.984426 72.000000 1.000000 6.000000 0.630000 1.000000 89.000000 4.532000
50% 7.791224e+06 1.322693e+07 40.723944 -73.956331 110.000000 2.000000 14.000000 1.430000 1.000000 285.000000 6.274000
75% 1.289945e+07 3.814982e+07 40.763752 -73.939342 175.000000 3.000000 34.000000 2.660000 2.000000 344.000000 8.852000
max 1.613930e+07 1.047817e+08 40.911003 -73.712936 695.000000 1250.000000 380.000000 14.560000 33.000000 365.000000 35.601000

In [56]:
fig, axs = plt.subplots(1, 2, sharey=True)
fig.suptitle('Do Reviews and Price Matter?', fontsize=20, fontweight='bold')
reviewed.plot(kind='scatter', x='reviews_per_month', y='price', ax=axs[0], figsize=(16, 8))
reviewed.plot(kind='scatter', x='avg_dist', y='price', ax=axs[1])


Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x124e73048>

The 2 plots above try to find if there would be any relationship between the number of reviews per month (trust and approval) as well as the average distance from the top attractions. Reviews per month plot does not seem to display any positive correlation between price and user approval, which makes sense as there are many other factors that determine an apartment rental price than user approval.

The average distance plot shows an interesting negative correlation between average distance and price. The lower the average distance is, the higher the price seems to be.

Both graphs show that many hosts like to mark prices discretely, by increments of 5 or 10, as there is a heavy concentration of data along y axis along the grid lines.


In [57]:
f, ax = plt.subplots(figsize=(11, 5))
sns.boxplot(x="neighbourhood_group", y="price", hue="room_type", data=reviewed, palette="PRGn")


Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x125d2d668>

The scatterplot above shows how big of a discrepancy apartment prices in Manhattan is. The top 25% of the apartments in Manhattan range in price from 400 USD to more than 700 USD, while those in Bronx span range of just 200 to 300.


In [58]:
reviewed2 = reviewed[((reviewed['price'] - reviewed['price'].mean()) / reviewed['price'].std()).abs() < 2]
sns.jointplot(x='avg_dist', y="price", data=reviewed2, kind='kde')


/Users/sglyon/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:898: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
/Users/sglyon/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[58]:
<seaborn.axisgrid.JointGrid at 0x125d6ccc0>

For a better visualization of the correlation between price and average distance, I plotted another graph with only the 95% of the dataset, (i.e. those with 2 standard deviations within from the mean). This joint plot shows that there are two highly concentrated areas of apartments around 5km away from top tourist attractions on average at price of around 90-100 USD per night, and those around 8km away and 50-60 USD per night.

Conclusion

By looking at several visualizations of the Airbnb data in New York, I was able to find some negative correlation between the average distance away from the most famous sights and price per night. Data grouped by neighbourhood yieled the expected result of the highest average price per night in Manhattan and the lowest in Bronx and Queen. The listings.csv data I used contained a summary of the data so it was not as easy to analyze the detailed factors in determining the price. Moving forward, however, I would love to analyze the detailed version of the open data to identify a more accurate correlation between price and apartment size, availability, reviews, average distace and so on.


In [ ]: